-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAP-05 ping-pong #1811
DAP-05 ping-pong #1811
Conversation
41fd0b6
to
c5b91f8
Compare
This is ready for review. Sadly the diff is large because the changes to test support code (like |
aggregator/src/aggregator/error.rs
Outdated
ref error @ PingPongError::StateMismatch(_, _) => ( | ||
format!("{error}"), | ||
format!("{peer_role}_ping_pong_message_state_mismatch"), | ||
// TODO(timg): is this the right error if state mismatch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be resolved: VdafPrepError
is correct in this case, because the DAP text says that any time a VDAF-level ping-pong routine yields state Rejected()
, the aggregator should construct a PrepareResp
with VdafPrepError
. So for that reason, the cases above where we return UnrecognizedMessage
need to be fixed. It's a shame, because in libprio we actually have specific knowledge of what went wrong, but per the specification we have to discard that information and flatten everything into VdafPrepError
. At least we can still increment a specific Prometheus metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PR description:
The existing implementation would store all three objects in the database, significantly increasing the per-report storage requirements.
What existing implementation is this referring to? The pre-ping-pong implementation didn't store all three objects (since only two objects were relevant), and the experimental ping-pong implementation didn't store all three objects (since it implemented something like the same optimization we now implement by splitting continued
to output a Transition
which has an evaluate
which finishes the continuation process).
Cargo.toml
Outdated
@@ -42,7 +42,9 @@ janus_messages = { version = "0.5", path = "messages" } | |||
k8s-openapi = { version = "0.18.0", features = ["v1_24"] } # keep this version in sync with what is referenced by the indirect dependency via `kube` | |||
kube = { version = "0.82.2", default-features = false, features = ["client", "rustls-tls"] } | |||
opentelemetry = { version = "0.20", features = ["metrics"] } | |||
prio = { version = "0.14.1", features = ["multithreaded"] } | |||
# TODO(timg): go back to a released version of prio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder: need to resolve this TODO before merging (which will require merging divviup/libprio-rs#683 & cutting a libprio-rs release)
messages/Cargo.toml
Outdated
@@ -20,7 +20,9 @@ hex = "0.4" | |||
num_enum = "0.7.0" | |||
# We can't pull prio in from the workspace because that would enable default features, and we do not | |||
# want prio/crypto-dependencies | |||
prio = { version = "0.14.1", features = ["multithreaded"] } | |||
# TODO(timg): go back to a released version of prio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder: need to resolve this TODO before merging (which will require merging divviup/libprio-rs#683 & cutting a libprio-rs release)
"Existing" is a poor word choice on my part. I think I meant "the implementation I had before I introduced |
90efb46
to
384ed5e
Compare
I've addressed the outstanding comments here, but this needs further changes to deal with whatever we end up doing about |
There are merge conflicts I need to clean up here. Also, there have been some changes to libprio-rs that are causing compilation issues. A lot of that has nothing to do with ping-pong, so I'm going to make a new PR that adopts libprio 0.15 (which will have most of VDAF-07 in it), then rebase this change on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one comment left, but it might require some refactoring. Getting ideal (minimal) storage & compute is tricky with this abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few final nitpicky comments.
ReportAggregationState::Finished => saw_finished = true, | ||
ReportAggregationState::Failed(_) => (), // ignore failed aggregations | ||
_ => (), // ignore failed aggregations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return an error (or maybe panic, though I think returning an error is wiser) if we see WaitingHelper
in Leader code or vice-versa?
fff1d93
to
07d8ce8
Compare
This change implements the DAP-05 ping-pong topology in which aggregators take turns preprocessing prepare shares into prepare messages. While this topology first appeared in DAP-05, this implementation follows DAP-06. This change depends on the implementation of the VDAF ping-pong topology added to crate `prio` in [1], which in turn conforms to the specification in VDAF-07. In the ping-pong topology, each DAP-layer step of aggregation now spans two VDAF rounds. An aggregator will use the prepare message it gets from its peer to advance by one VDAF round, and then can use the prepare share it just computed along with the peer's prepare share to advance by another. This incurs some changes to what intermediate values are stored by aggregators. In the case where a leader is continuing/waiting, it will have computed a prepare state, a prepare message for the current round and a prepare share for the next round. The naive implementation would store all three objects in the database, significantly increasing the per-report storage use. To mitigate this, the leader stores a `prio::topology::ping_pong::PingPongTransition`, which will contain a prepare state and a prepare message (both of which are generally much smaller than prepare shares), from which the next prepare state and importantly prepare share can be recomputed. On the helper side, there's no way around storing the prepare share: we store the most recently computed `PrepareResp` so that we can handle aggregation jobs idempotently. But to avoid storing prepare messages twice, the continuing/waiting helper stores just a prepare state and a `PingPongMessage`. The main benefit of this change is to reduce how many round trips between aggregators are needed to prepare reports. Quite a few tests used Prio3 but depended on having the leader or helper in the `Waiting` state after running aggregation initialization. Accordingly, those tests are changed to run Poplar1, which now takes 2 rounds. [1]: divviup/libprio-rs#683 Co-authored-by: Tim Geoghegan <[email protected]> Part of #1669
9e74231
to
a155d81
Compare
This change implements the DAP-05 ping-pong topology in which aggregators take turns preprocessing prepare shares into prepare messages. While this topology first appeared in DAP-05, this implementation follows the changes in 1, which should appear in DAP-06.
This change depends on the implementation of the VDAF ping-pong topology added to crate
prio
in 2, which in turn conforms to the specification added after VDAF-06 and further tweaked in 3 (we expect this to be published soon as VDAF-07).This commit makes some changes to what intermediate values are stored by aggregators. In the case where an aggregator is continuing, it will have computed a prepare state, a prepare message for the current round and a
prepare share for the next round. The existing implementation would store all three objects in the database, significantly increasing the per-report storage requirements. In particular, this makes things worse for the Helper, which previously never needed to store a prepare share because the Leader always took responsibility for combining prepare
shares.
To mitigate this, we instead have aggregators store a
prio::ping_pong::topology::Transition
, which will contain a preparestate and a prepare message (both of which are generally much smaller than prepare shares), from which the next prepare state and importantly prepare share can be recomputed.
The main benefit of this change is to reduce how many round trips between aggregators are needed to prepare reports. Quite a few tests used Prio3 but depended on having the leader or helper in the
Waiting
state after running aggregation initialization. Accordingly, those tests are changed to run Poplar1, which now takes 2 rounds.Part of #1669